{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Copying data from Redshift to S3 and back\n",
    "\n",
    "---\n",
    "\n",
    "---\n",
    "## Contents\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "1. [Reading from Redshift](#Reading-from-Redshift)\n",
    "1. [Upload to S3](#Upload-to-S3)\n",
    "1. [Writing back to Redshift](#Writing-back-to-Redshift)\n",
    "\n",
    "\n",
    "\n",
    "## Introduction\n",
    "In this notebook we illustrate how to copy data from Redshift to S3 and vice-versa.\n",
    "\n",
    "### Prerequisites\n",
    "In order to successfully run this notebook, you'll first need to:\n",
    "1. Have a Redshift cluster within the same VPC.\n",
    "1. Preload that cluster with data from the [iris data set](https://archive.ics.uci.edu/ml/datasets/iris) in a table named public.irisdata.\n",
    "1. Update the credential file (`redshift_creds_template.json.nogit`) file with the appropriate information.\n",
    "\n",
    "Also, note that this Notebook instance needs to resolve to a private IP when connecting to the Redshift instance. There are two ways to resolve the Redshift DNS name to a private IP:\n",
    "1. The Redshift cluster is not publicly accessible so by default it will resolve to private IP.\n",
    "1. The Redshift cluster is publicly accessible and has an EIP associated with it but when accessed from within a VPC, it should resolve to private IP of the Redshift cluster. This is possible by setting following two VPC attributes to yes: DNS resolution and DNS hostnames. For instructions on setting that up, see Redshift public docs on [Managing Clusters in an Amazon Virtual Private Cloud (VPC)](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html).\n",
    "\n",
    "### Notebook Setup\n",
    "Let's start by installing `psycopg2`, a PostgreSQL database adapter for the Python, adding a few imports and specifying a few configs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!conda install -y -c anaconda psycopg2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import boto3\n",
    "import pandas as pd\n",
    "import json\n",
    "import psycopg2\n",
    "import sqlalchemy as sa\n",
    "\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "bucket='<your_s3_bucket_name_here>' # put your s3 bucket name here, and create s3 bucket\n",
    "prefix = 'sagemaker/DEMO-redshift'\n",
    "# customize to your bucket where you have stored the data\n",
    "\n",
    "credfile = 'redshift_creds_template.json.nogit'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading from Redshift\n",
    "We store the information needed to connect to Redshift in a credentials file. See the file `redshift_creds_template.json.nogit` for an example. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "# Read credentials to a dictionary\n",
    "with open(credfile) as fh:\n",
    "    creds = json.loads(fh.read())\n",
    "\n",
    "# Sample query for testing\n",
    "query = 'select * from public.irisdata;'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a connection to redshift using our credentials, and use this to query Redshift and store the result in a pandas DataFrame, which we then save."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(\"Reading from Redshift...\")\n",
    "\n",
    "def get_conn(creds): \n",
    "    conn = psycopg2.connect(dbname=creds['db_name'], \n",
    "                            user=creds['username'], \n",
    "                            password=creds['password'],\n",
    "                            port=creds['port_num'],\n",
    "                            host=creds['host_name'])\n",
    "    return conn\n",
    "\n",
    "def get_df(creds, query):\n",
    "    with get_conn(creds) as conn:\n",
    "        with conn.cursor() as cur:\n",
    "            cur.execute(query)\n",
    "            result_set = cur.fetchall()\n",
    "            colnames = [desc.name for desc in cur.description]\n",
    "            df = pd.DataFrame.from_records(result_set, columns=colnames)\n",
    "    return df\n",
    "\n",
    "df = get_df(creds, query)\n",
    "\n",
    "print(\"Saving file\")\n",
    "localFile = 'iris.csv'\n",
    "df.to_csv(localFile, index=False)\n",
    "\n",
    "print(\"Done\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(\"Writing to S3...\")\n",
    "\n",
    "fObj = open(localFile, 'rb')\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, localFile)).upload_fileobj(fObj)\n",
    "print(\"Done\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing back to Redshift\n",
    "\n",
    "We now demonstrate the reverse process of copying data from S3 to Redshift. We copy back the same data but in an actual application the data would be the output of an algorithm on Sagemaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(\"Reading from S3...\")\n",
    "# key unchanged for demo purposes - change key to read from output data\n",
    "key = os.path.join(prefix, localFile)\n",
    "\n",
    "s3 = boto3.resource('s3')\n",
    "outfile = 'iris2.csv'\n",
    "s3.Bucket(bucket).download_file(key, outfile)\n",
    "df2 = pd.read_csv(outfile)\n",
    "print(\"Done\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(\"Writing to Redshift...\")\n",
    "\n",
    "connection_str = 'postgresql+psycopg2://' + \\\n",
    "                  creds['username'] + ':' + \\\n",
    "                  creds['password'] + '@' + \\\n",
    "                  creds['host_name'] + ':' + \\\n",
    "                  creds['port_num'] + '/' + \\\n",
    "                  creds['db_name'];\n",
    "                    \n",
    "df2.to_sql('irisdata_v2', connection_str, schema='public', index=False)\n",
    "print(\"Done\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We read the copied data in Redshift - success!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pd.options.display.max_rows = 2\n",
    "conn = get_conn(creds)\n",
    "query = 'select * from irisdata3'\n",
    "df = pd.read_sql_query(query, conn)\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_python3)",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
